Model Selection

Multimodal input

# Multimodal input

Mistral Small 3.2 24B Instruct 2506 GGUF

Mistral Small 3.2 24B Instruct 2506 is a multilingual large language model that supports text and image input and text output, with a context length of 128k.

Image-to-Text Supports Multiple Languages

lmstudio-community

Gemma 3n E2B It

Gemma 3n is a lightweight and state-of-the-art open-source multimodal model family launched by Google, built on the same research and technology as the Gemini model. It supports text, audio, and visual inputs and is suitable for various tasks.

Qwen2.5 Omni 7B GGUF

Qwen2.5-Omni-7B-GGUF is the GGUF format version of the Qwen2.5-Omni-7B model, supporting multimodal inputs including text, audio, and images.

Large Language Model English

Qwen2.5 Omni 3B GGUF

Qwen2.5-Omni-3B is a multimodal model that supports text, audio, and image input, but does not support video input or audio generation.

Large Language Model English

DAM-3B-Video is a 3-billion-parameter vision-language model capable of generating fine-grained local descriptions for user-specified image/video regions.

Safetensors English

Stable Diffusion 3.5 Large Controlnet Canny

Canny edge detection control network adapted for Stable Diffusion 3.5 large model, used for precise control of image generation process

Image Generation English

The first DiT-based video generation model capable of real-time generation of high-quality videos, supporting two scenarios: text-to-video and image + text-to-video.

Text-to-Video English

Diva Llama 3 V0 8b

DiVA Llama 3 is an end-to-end voice assistant model capable of processing both speech and text inputs, trained using distillation loss.

Featured Recommended AI Models

AIbase

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

© 2025AIbase